[Feat] support fp8 quantization in update weights #24488

weixiao-huang · 2025-09-09T07:22:52Z

This MR makes that process_weights_after_loading could be reused in fp8 quantization

gemini-code-assist

Code Review

This pull request refactors weight processing for FP8 quantization to support weight updates, primarily by introducing a _wrap_parameter_or_copy helper function. This is a good change for compatibility with CUDA graphs. The change in kv_cache.py also improves robustness by ensuring quantization scales are always present. However, I've found a critical issue in Fp8MoEMethod.process_weights_after_loading where a parameter is not correctly unwrapped, leading to a no-op update and incorrect behavior in certain code paths. I've also suggested an improvement in kv_cache.py to make the code more robust by removing some overly strict assertions.

gemini-code-assist · 2025-09-09T07:27:02Z

vllm/model_executor/layers/quantization/fp8.py

In the else branch of the conditional starting at line 733, the variables w2_weight and w2_weight_scale_inv are assigned torch.nn.Parameter objects on lines 755-756, instead of their underlying tensor data. Consequently, these calls to _wrap_parameter_or_copy become no-ops due to self-copying, which is likely not the intended behavior and can lead to incorrect weight updates.

This is inconsistent with how w13_weight is handled in the same block, which correctly uses .data. To fix this, you should modify lines 755-756 to extract the tensor data, like so:

# In vllm/model_executor/layers/quantization/fp8.py, lines 755-756 w2_weight = layer.w2_weight.data w2_weight_scale_inv = layer.w2_weight_scale_inv.data

Since the fix is outside the diff, I'm placing this comment here to highlight this critical issue.

gemini-code-assist · 2025-09-09T07:27:02Z

vllm/model_executor/layers/quantization/kv_cache.py

These assertions could make the code brittle. If another part of the codebase modifies these attributes partially (e.g., removes q_scale but not k_scale), these assertions will fail. The main goal here is to ensure all weights are present if any are missing. Simply checking for q_scale and then creating all weights is sufficient and more robust against unforeseen state changes.

…dd missing scale attributes Signed-off-by: huangweixiao <huangweixiao@msh.team>

faresobeid · 2025-11-06T03:19:15Z

Update?

kylesayrs · 2025-11-11T16:24:03Z

vllm/model_executor/layers/quantization/fp8.py

    return x.stride(0) == m * n and x.stride(1) == 1 and x.stride(2) == m


+def _wrap_parameter_or_copy(layer: torch.nn.Module, name: str,


As of torch28 (and torch27 using this pr) torch compile supports parameter subclasses. Given this, all that should be required is that the newly (maybe padded) weight is updated, a new Parameter class need not be created.

agreed. In fact, the second branch in this code never gets triggered. All that is needed is to clean up fp8.py from param = Parameter(...) statement that drop the weight loaders.

weixiao-huang requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 9, 2025 07:22

weixiao-huang force-pushed the fix/fp8-update-weights branch from 5dee0fe to 2f90b31 Compare September 9, 2025 07:24

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

youkaichao marked this pull request as draft September 9, 2025 07:40

[BugFix] use _wrap_parameter_or_copy instead of using Parameter and a…

cb70707

…dd missing scale attributes Signed-off-by: huangweixiao <huangweixiao@msh.team>

weixiao-huang force-pushed the fix/fp8-update-weights branch from 2f90b31 to cb70707 Compare September 9, 2025 08:56

robertgshaw2-redhat mentioned this pull request Nov 10, 2025

[Feature][RL]: Fix Fp8 Weight Loading for RL #28425

Open

1 task

kylesayrs reviewed Nov 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feat] support fp8 quantization in update weights #24488

[Feat] support fp8 quantization in update weights #24488

Uh oh!

weixiao-huang commented Sep 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 9, 2025

Uh oh!

gemini-code-assist bot Sep 9, 2025

Uh oh!

faresobeid commented Nov 6, 2025

Uh oh!

kylesayrs Nov 11, 2025 •

edited

Loading

Uh oh!

rizar Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return x.stride(0) == m * n and x.stride(1) == 1 and x.stride(2) == m


		def _wrap_parameter_or_copy(layer: torch.nn.Module, name: str,

Uh oh!

[Feat] support fp8 quantization in update weights #24488

Are you sure you want to change the base?

[Feat] support fp8 quantization in update weights #24488

Uh oh!

Conversation

weixiao-huang commented Sep 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

faresobeid commented Nov 6, 2025

Uh oh!

kylesayrs Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rizar Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kylesayrs Nov 11, 2025 •

edited

Loading